APPENDIX 
Claim Set with Elected Claims 



1. In a computer system, a method for finding near identities in a DNA database, the method 
comprising the steps of: 

providing a first database and a second database; 

generating for the first database a first tag array and for the second database a second tag array; 
and 

comparing the first tag array to the second tag array using a comparison model to determine areas 
of the first database which match areas of the second database. 

2. The method of claim 1, wherein the first database is a genomic DNA sequence database 
and the second database is a genomic DNA sequence database. 

3. The method of claim 1, wherein the first database is a genomic DNA sequence database 
and the second database is a cDNA sequence database. 

4. The method of claim 1, wherein, the first database is a cDNA sequence database and the 
second database is a cDNA sequence database. 

5. The method of claim 1, wherein the step of generating for the first database a first tag 
array and for the second database a second tag array further comprises steps of generating a tag 
record which contains a tag value, a value representing a sequence ED of a sequence from which 
the tag value was generated and a value representing a position on a sequence from which the tag 
value was generated and storing the tag record in an appropriate tag array. 
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6. The method of claim 5, wherein the tag value is computed as 

|DNA| 

T = ^(DNA-M^modP 

i=\ 

where T is the tag value 

DNA is a fragment of a DNA sequence, 
| DNA | is a length of the DNA fragment, 

P is a prime number such that P • 4 can be stored in one computer word 
and where /(DNA,) evaluates to 0, 1, 2, and 3 when DNA, is A, C, G, and T respectively. 

7. The method of claim 1, wherein the step of comparing the first tag array to the second tag 
array using a comparison model to determine areas of the first database which match areas of the 
second database further comprises steps of sorting the first tag array on tag value to produce a 
sorted first tag array, and sorting the second tag array on tag value to produce a sorted second tag 
array. 

8. The method of claim 7, further comprising steps of comparing each tag of each sequence 
of length / from the sorted first tag array to tags in the sorted second tag array and for those tag 
values that are equal, recording the tag values and their respective sequence ID and tag position 
on the sequence values in a matched tag array. 

9. The method of claim 8, further comprising steps of using the matched tag array to 
calculate a match density value for a sequence, where the match density is equal to a total 
number of tags for the sequence in the matched tag array divided by a total number of tags for the 
sequence in the sorted second tag array, and using the match density to indicate sequences of near 
identity when the match density is greater than an indicator value. 
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10. The method of claim 9, wherein the indicator value is 0.9. 

11. The method of claim 8, further comprising steps of using the matched tag array to 
calculate a mean difference offset value for a pair of sequences; wherein a set of offsets are 
differences between sequence positions associated with pairs of matching tags, and wherein a 
median offset is a median value of the set of offsets, and wherein a set of differences comprises 
differences between the median value and each of the offsets, and wherein a mean difference 
offset value is a mean value of the set of differences, whereby sequences of near identity are 
indicated when the mean difference offset value is less than an indicator value. 

12. The method of claim 11, wherein the indicator value is 25. 

13. The method of claim 11, further comprising steps of using the matched tag array to 
calculate a rank correlation coefficient for a pair of sequences, wherein the rank correlation* 
coefficient is computed as 



where r is the rank correlation coefficient for a pair of sequences, 

di is a difference in rank of the tags, and 

m is a number of tag matches, and 
wherein sequences of near identity are indicated when the rank correlation coefficient for a pair 
of sequences is less than an indicator value. 

14. The method of claim 13, wherein the indicator value is 0.75. 




16. A computer system, having a processor, memory, external data storage, input/output 
mechanisms, a display, for finding near identities in a DNA database, comprising: 
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a first database and a second database; 

logic mechanisms in the computer for generating for the first database a first tag array and for the 
second database a second tag array; and 

a comparing mechanism in the computer for comparing the first tag array to the second tag array 
using a comparison model to determine areas of the first database which match areas of the 
second database. 

17. The computer system of claim 16, wherein the first database is a genomic DNA sequence 
database and the second database is a genomic DNA sequence database. 

18. The computer system of claim 16, wherein the first database is a cDNA sequence 
database and the second database is a cDNA sequence database. 

19. The computer system of claim 16, wherein the first database is a genomic DNA sequence 
database and the second database is a cDNA sequence database. 

20. The computer system of claim 16, wherein the logic mechanisms for generating for the 
first database a first tag array and for the second database a second tag array further comprises 
logic mechanisms for generating a tag record which contains a tag value, a value representing a 
sequence ID of a sequence from which the tag .value was generated and a value representing a 
position on a sequence from which the tag value was generated and a logic mechanism for 
storing the tag record in an appropriate tag array. 

21. The computer system of claim 20, wherein the tag value is computed as 

| DNA | 

T = £/(DNA,.)-4 (, - !) modP 
1=1 



where T is the tag value 



5 



DNA is a fragment of a DNA sequence, 
| DNA | is a length of the DNA fragment, 

P is a prime number such that P • 4 can be stored in one computer word 
and where /(DNA/) evaluates to 0, 1, 2, and 3 when DNA/ is A, C, G, and T respectively. 

22. The computer system of claim 16, wherein the comparing mechanism for comparing the 
first tag array to the second tag array using a comparison model to determine areas of the first 
database which match areas of the second database further comprises logic mechanisms for 
sorting the first tag array on tag value to produce a sorted first tag array, and for sorting the 
second tag array on tag value to produce a sorted second tag array. 

23. The computer system of claim 22, further comprising logic mechanisms for comparing 
each tag of each sequence of length / from the sorted first tag array to tags in the sorted second 
tag array and for those tag values that are equal, recording the tag values and their respective 
sequence ED and tag position on the sequence values in a matched tag array. 

24. The computer system of claim 23, further comprising a logic mechanism for using the 
matched tag array to calculate a match density value for a sequence, where the match density is 
equal to a total number of tags for the sequence in the matched tag array divided by a total 
number of tags for the sequence in the sorted second tag array, and using the match density to 
indicate sequences of near identity when the match density is greater than an indicator value. 

25. The computer system of claim 24, wherein the indicator value is 0.9. 

26. The computer system of claim 23, further comprising a logic mechanism for using the 
matched tag array to calculate a mean difference offset value for a pair of sequences; wherein a 
set of offsets are differences between sequence positions associated with pairs of matching tags, 



and wherein a median offset is a median value of the set of offsets, and wherein a set of 
differences comprises differences between the median value and each of the offsets, and wherein 
a mean difference offset value is a mean value of the set of differences, whereby sequences of 
near identity are indicated when the mean difference offset value is less than an indicator value. 

27. The computer system of claim 26, wherein the indicator value is 25. 

28. The computer system of claim 23, further comprising a logic mechanism for using the 
matched tag array to calculate a rank correlation coefficient for a pair of sequences, wherein the 
rank correlation coefficient is computed as 




where r is the rank correlation coefficient for a pair of sequences, 

<i,is a difference in rank of the tags, and 

m is a number of tag matches, and 
wherein sequences of near identity are indicated when the rank correlation coefficient for a pair 
of sequences is less than an indicator value. 



29. The computer system of claim 28, wherein the indicator value is 0.75. 



